Gemma/[Gemma_2]RAG_PDF_Search_in_multiple_documents_on_Colab.ipynb (390 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "d70a9a1f4a86"
},
"source": [
"Fabrício Carraro - @fabriciocarraro - https://github.com/fabriciocarraro"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Tce3stUlHN0L"
},
"source": [
"##### Copyright 2024 Google LLC."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "tuOe1ymfHZPu"
},
"outputs": [],
"source": [
"# @title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5f331c2c21cb"
},
"source": [
"# RAG - PDF Search in multiple documents using Gemma 2 2B on Colab\n",
"\n",
"This project demonstrates a pipeline for extracting, processing, and querying text data from PDF documents on Google Colab using natural language processing (NLP) techniques and Google's open-source model Gemma 2 2B. The system allows users to input a query, which is then answered based on the content of the PDFs."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github"
},
"source": [
"<table align=\"left\">\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/[Gemma_2]RAG_PDF_Search_in_multiple_documents_on_Colab.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
" </td>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9e6c4596135f"
},
"source": [
"## Setup\n",
"### Select the Colab runtime\n",
"\n",
"To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:\n",
"\n",
"1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.\n",
"2. Select **Change runtime type**.\n",
"3. Under **Hardware accelerator**, select **T4 GPU**.\n",
"\n",
"### Gemma 2 setup on Hugging Face\n",
"\n",
"This cookbook uses Gemma 2B instruction tuned model through Hugging Face. So you will need to:\n",
"\n",
"* Get access to Gemma 2 on [huggingface.co](huggingface.co) by accepting the Gemma 2 license on the Hugging Face page of the specific model, i.e., [Gemma 2B IT](https://huggingface.co/google/gemma-2-2b-it).\n",
"* Generate a [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens) and configure it as a Colab secret 'HF_TOKEN'.\n",
"\n",
"## Retrieval-Augmented Generation (RAG)\n",
"\n",
"Large Language Models (LLMs) can learn new abilities without directly being trained on them. However, LLMs have been known to \"hallucinate\" when tasked with providing responses for questions they have not been trained on. This is partly because LLMs are unaware of events after training. It is also very difficult to trace the sources from which LLMs draw their responses from. For reliable, scalable applications, it is important that an LLM provides responses that are grounded in facts and is able to cite its information sources.\n",
"\n",
"A common approach used to overcome these constraints is called Retrieval Augmented Generation (RAG), which augments the prompt sent to an LLM with relevant data retrieved from an external knowledge base through an Information Retrieval (IR) mechanism. The knowledge base can be your own corpora of documents, databases, or APIs."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "69d82e84067b"
},
"source": [
"## Installing and importing dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_2ac23WaxVGR"
},
"outputs": [],
"source": [
"!pip install -q transformers sentence_transformers faiss-cpu torch PyPDF2 nltk"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "HxlPGnyo0lei"
},
"outputs": [],
"source": [
"import torch\n",
"from transformers import AutoTokenizer, AutoModelForCausalLM\n",
"from sentence_transformers import SentenceTransformer\n",
"import faiss\n",
"import numpy as np\n",
"import pandas as pd\n",
"import PyPDF2\n",
"import os\n",
"import nltk\n",
"nltk.download('punkt')\n",
"from nltk.tokenize import sent_tokenize\n",
"from google.colab import userdata"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "a2694611f02f"
},
"source": [
"## Setting up the model and tokenizer"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "rQulEL_Ry-J4"
},
"outputs": [],
"source": [
"HUGGING_FACE_ACCESS_TOKEN = userdata.get('HF_TOKEN')\n",
"\n",
"model_name = 'google/gemma-2-2b-it'\n",
"\n",
"model = AutoModelForCausalLM.from_pretrained(\n",
" model_name,\n",
" torch_dtype=torch.float16,\n",
" token=HUGGING_FACE_ACCESS_TOKEN\n",
" ).to('cuda')\n",
"\n",
"tokenizer = AutoTokenizer.from_pretrained(model_name, token=HUGGING_FACE_ACCESS_TOKEN)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b664e2ecaed2"
},
"source": [
"## Extracting and tokenizing info from the PDF files\n",
"\n",
"The `extract_text_from_pdf()` function will look for all PDF files in the `pdf_path` folder.\n",
"\n",
"The `split_text_into_chunks()` function gets the text and breaks it down into smaller chunks."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-JBcrxgxx9Fg"
},
"outputs": [],
"source": [
"def extract_text_from_pdf(pdf_path):\n",
" try:\n",
" with open(pdf_path, 'rb') as file:\n",
" reader = PyPDF2.PdfReader(file)\n",
" text = \"\".join([page.extract_text() for page in reader.pages])\n",
" return text\n",
" except Exception as e:\n",
" print(f\"Error reading {pdf_path}: {e}\")\n",
" return \"\"\n",
"\n",
"def split_text_into_chunks(text, max_chunk_size=1000):\n",
" sentences = sent_tokenize(text)\n",
" chunks = []\n",
" current_chunk = \"\"\n",
"\n",
" for sentence in sentences:\n",
" if len(current_chunk) + len(sentence) <= max_chunk_size:\n",
" current_chunk += sentence + \" \"\n",
" else:\n",
" chunks.append(current_chunk.strip())\n",
" current_chunk = sentence + \" \"\n",
"\n",
" if current_chunk:\n",
" chunks.append(current_chunk.strip())\n",
"\n",
" return chunks"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "a365c502d491"
},
"source": [
"## Extracting info from the PDFs\n",
"\n",
"Set the variable `pdf_directory` with the path where your PDF files are. In this case, running on Google Colab, they would be in a folder called `PDFs`, that can be created on the content area on the left side.\n",
"\n",
"A Pandas DataFrame is created containing the path of the corresponding PDF, its chunks and the embedding vector of its chunks."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "jjPD3LOfHYdg"
},
"outputs": [],
"source": [
"encoder = SentenceTransformer('all-MiniLM-L6-v2')\n",
"\n",
"# Process PDF files\n",
"pdf_directory = \"/content/PDFs/\"\n",
"df_documents = pd.DataFrame(columns=['path', 'text_chunks', 'embeddings'])\n",
"\n",
"for filename in os.listdir(pdf_directory):\n",
" if filename.endswith(\".pdf\"):\n",
" print(filename)\n",
" pdf_path = os.path.join(pdf_directory, filename)\n",
" text = extract_text_from_pdf(pdf_path)\n",
" chunks = split_text_into_chunks(text)\n",
" document_embeddings = encoder.encode(chunks)\n",
" new_row = pd.DataFrame({'path': [pdf_path], 'text_chunks': [chunks], 'embeddings': [document_embeddings]})\n",
" df_documents = pd.concat([df_documents, new_row], ignore_index=True)\n",
"\n",
"df_documents"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aae677d77756"
},
"source": [
"## Creating a FAISS index from all document embeddings\n",
"\n",
"Faiss is a library for efficient similarity search and clustering of vectors. The IndexFlatL2 algorithm will be applied to all chunk embedding vectors."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Vsp1NhGNPyJb"
},
"outputs": [],
"source": [
"all_embeddings = np.vstack(df_documents['embeddings'].tolist())\n",
"dimension = all_embeddings.shape[1]\n",
"index = faiss.IndexFlatL2(dimension)\n",
"index.add(all_embeddings)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1a1df3e91076"
},
"source": [
"## Calculating the embedding distance and generating an answer \n",
"\n",
"The `find_most_similar_chunks()` function will create an embedding vector for your query and compare its similarity to all the chunks it retrieved from the PDF files, returning the most similar one, which will be used as the context for the next function.\n",
"\n",
"The `generate_response()` function will generate an answer using our selected model (Gemma 2 2B) based on the context retrieved from the most similar info chunk."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6TxTcBlnHo_v"
},
"outputs": [],
"source": [
"def find_most_similar_chunks(query, top_k=3):\n",
" query_embedding = encoder.encode([query])\n",
" distances, indices = index.search(query_embedding, top_k)\n",
" results = []\n",
" total_chunks = sum(len(chunks) for chunks in df_documents['text_chunks'])\n",
" for i, idx in enumerate(indices[0]):\n",
" if idx < total_chunks:\n",
" doc_idx = 0\n",
" chunk_idx = idx\n",
" while chunk_idx >= len(df_documents['text_chunks'].iloc[doc_idx]):\n",
" chunk_idx -= len(df_documents['text_chunks'].iloc[doc_idx])\n",
" doc_idx += 1\n",
" results.append({\n",
" 'document': df_documents['path'].iloc[doc_idx],\n",
" 'chunk': df_documents['text_chunks'].iloc[doc_idx][chunk_idx],\n",
" 'distance': distances[0][i]\n",
" })\n",
" return results\n",
"\n",
"def generate_response(query, context, max_length=1000):\n",
" prompt = f\"Context: {context}\\n\\nQuestion: {query}\\n\\nAnswer:\"\n",
" input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids.to('cuda')\n",
"\n",
" with torch.no_grad():\n",
" output = model.generate(input_ids, max_new_tokens=max_length, num_return_sequences=1)\n",
"\n",
" decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)\n",
"\n",
" # Extracting the answer part by removing the prompt portion\n",
" answer_start = decoded_output.find(\"Answer:\") + len(\"Answer:\")\n",
" answer = decoded_output[answer_start:].strip()\n",
"\n",
" return answer\n",
"\n",
"def query_documents(query):\n",
" similar_chunks = find_most_similar_chunks(query)\n",
" context = \" \".join([result['chunk'].replace(\"\\n\", \"\") for result in similar_chunks])\n",
" response = generate_response(query, context)\n",
" return response, similar_chunks"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6972b9d8fbe6"
},
"source": [
"## Looking for info in the PDFs\n",
"\n",
"The variable `query` contains the information you want to retrieve from the PDF files."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "kKLnKXHKHxe4"
},
"outputs": [],
"source": [
"query = \"How many types of regular Train Car cards are there?\"\n",
"answer, relevant_chunks = query_documents(query)\n",
"\n",
"print(f\"Query: {query}\\n\\n-----\\n\")\n",
"print(f\"Generated answer: {answer}\\n\\n-----\\n\")\n",
"print(\"Relevant chunks:\")\n",
"for chunk in relevant_chunks:\n",
" print(f\"Document: {chunk['document']}\")\n",
" print(f\"Chunk: {chunk['chunk']}\".replace(\"\\n\", \"\"))\n",
" print(f\"Distance: {chunk['distance']}\")\n",
" print()"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"name": "[Gemma_2]RAG_PDF_Search_in_multiple_documents_on_Colab.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}